Week 8.1 - What Multimodal AI Can See, Hear, and Read

What We'll Cover

The assumption that AI is a text tool is already outdated. Modern models can see images, read documents, transcribe audio, and process hours of video. This week is about understanding what these capabilities genuinely offer researchers — and where the gap between “impressive demo” and “reliable research tool” remains significant.

A key framing distinction sits at the centre of everything this week: AI reading vs. AI understanding. These are not the same thing. A model can describe a chart fluently and still get the numbers wrong. It can transcribe speech accurately and still hallucinate sentences that were never said. It can extract text from a PDF and still misread the structure of a complex table.

Each sub-lesson in Week 8 examines this distinction through a different modality. This overview sets up the landscape so you know where you are going — and why the distinction matters before you pick up any multimodal tool.

🌏 The Four Modalities

Multimodal AI refers to models that accept inputs beyond plain text. Four modalities are most relevant to research workflows in 2026. Each opens different tasks — and carries different reliability profiles.

📷 Images

Scientific figures, microscopy, satellite imagery, photographs, charts, and diagrams. Images were the first non-text modality to reach general-purpose LLMs and remain the most widely tested.

Describing and captioning figures for accessibility
Qualitative interpretation of charts and graphs
Comparing images for obvious visual differences
Satellite and geospatial imagery analysis
Microscopy documentation and description

Representative tools: Claude (family), GPT (family), Gemini (family)

🎤 Audio

Interviews, focus groups, field recordings, lectures, oral histories, podcasts, and environmental sound. Audio capabilities range from transcription-only to full end-to-end audio reasoning.

Transcription of interviews and focus groups
Speaker diarisation (who said what)
Transcription in African and low-resource languages
Thematic analysis of spoken content
Field recording documentation

Representative tools: Whisper large-v3, GPT (family), Gemini (family), Intron Sahara

📄 Documents

PDFs, scanned papers, tables, forms, supplementary data files, archival documents. Document understanding combines OCR with structural reasoning — extracting not just text but how that text is organised.

Table extraction from PDFs and supplementary files
OCR of scanned archival documents
Summarisation of long research papers
Structured data extraction from forms
Cross-document comparison and synthesis

Representative tools: Claude (family), Docling, LlamaParse, Azure Document Intelligence

🎥 Video

Lectures, experiments, classroom observations, field interviews, documentary footage, and recorded procedures. Video is the most demanding modality — combining visual, audio, and temporal reasoning.

Classroom observation analysis
Summarising recorded lectures or seminars
Experiment documentation from video recordings
Interview transcription with visual context
Long-form documentary or archival footage analysis

Representative tools: Gemini Pro tier (~1 hr standard, longer at low FPS), GPT (family, frames + audio)

🔬 The Model Landscape

No single model leads on all four modalities. Understanding which model handles which input — and where native capability ends and workarounds begin — prevents both over-reliance and missed opportunity.

Model	Developer	Images	Audio	Video	Documents	Context Window	Key Strength for Research
Claude (family)	Anthropic	✓	✗ (no native)	✗ (no native)	✓	200K tokens	Deep document reasoning, code generation alongside analysis
GPT (family)	OpenAI	✓	✓ (end-to-end)	✓ (frames + audio)	✓	128K tokens	Best general-purpose; very low audio latency in real-time tier; strong mixed image + text
Gemini (Pro tier)	Google DeepMind	✓	✓ (native)	✓ (native, ~1 hr standard)	✓	1M tokens	Best for long video/audio; ~1 hr video (standard) or ~8.5 hrs audio-only in one call (durations depend on FPS and resolution; check current API docs)
Whisper large-v3	OpenAI	✗	✓ (transcription)	✗	✗	N/A	Open-source audio transcription; ~2.0% WER on clean audio

🔔 Audio and Video for Claude Users

Claude does not currently process audio or video natively. If you are working primarily in Claude, the recommended workflow is to combine it with a transcription tool first: use Whisper large-v3 or OpenAI's transcription endpoint to convert audio or video to text, then bring that text into Claude for analysis. Alternatively, use the Gemini API for the audio/video step, then pass the transcript or summary to Claude for deeper reasoning or code generation. The tools complement each other rather than compete.

📍 Research Use-Case Map

Different research domains have different primary modalities. This table maps common UCT research contexts to the modality and tooling most relevant to them. It is a starting point — not every cell will match your specific project.

Research Domain	Key Modality	Example Use	Best Tool
Qualitative (interviews / focus groups)	Audio	Transcription and thematic analysis	Whisper + ATLAS.ti / NVivo
Quantitative (survey data in PDFs)	Documents	Table extraction from supplementary files	Docling / LlamaParse
Scientific publishing	Images	Figure interpretation, alt text generation	Claude (family) / GPT (family)
Field research (Africa, low-resource)	Audio	Transcription in African languages	Intron Sahara / Lelapa AI
Archival research	Documents	Scanned document OCR	Marker / Azure Doc Intelligence
Earth / environmental science	Images	Satellite imagery analysis	TerraTorch + Prithvi / Gemini Geospatial
Medical / health sciences	Images	Assisting with image description (not diagnosis)	Frontier VLM (with verification)
Education research	Video	Classroom observation analysis	Gemini (family) / ClassMind

🧠 Reading vs. Understanding — The Core Distinction

This is the most important conceptual framing of the week. It applies to every modality and every tool. Understanding it before using any multimodal AI is not optional — it is the difference between using these tools safely and being misled by them.

The Reading – Understanding Gap

When an AI model “reads” an image or document, it is pattern-matching against its training data. When it “understands” that image or document, it would be reasoning about what the content means in context. Current models are genuinely impressive at the former — and frequently overconfident about the latter.

A model can describe a chart fluently and still get the numbers wrong. It can transcribe speech accurately and still hallucinate sentences that were never said. It can extract text from a PDF and still misread the structure of a complex table.

The distinction between reading and understanding is not philosophical — it is the practical difference between a tool you can trust and one that will quietly mislead you.

📊 Case Study: The Real-World Chart Gap

The CharXiv benchmark (NeurIPS 2024, Princeton University) tests AI on real scientific charts from actual published papers — not simplified test datasets designed for evaluation convenience. It has become the standard reference for evaluating genuine scientific chart understanding.

When the benchmark was published in 2024, GPT-4o scored 47.1% on reasoning questions versus 80.5% for humans — a gap that received considerable attention. Frontier model scores have risen substantially since (top models now approach human performance on the original benchmark). But the core finding persists in evaluations of newer models: real-world performance on actual scientific figures consistently lags performance on simplified, purpose-built benchmarks. The headline gap has narrowed; the underlying pattern has not gone away.

One mechanism: when models do struggle, they often appear to read labels, axis titles, and captions — the text surrounding the chart — rather than actually processing the chart geometry. The language decoder generates plausible-sounding descriptions that drift from what is visually present in the image. Frontier models have improved on this, but the failure mode is not fully solved.

This is the week's central finding. Each sub-lesson shows a version of it in a different modality: charts, documents, audio. The benchmark-vs-real-world gap is not a chart-specific failure. It is a general property of current multimodal AI that you need to build into your research workflows.

Source: Wang et al. (2024). “CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs.” NeurIPS 2024. arxiv.org/abs/2406.18521

⚡ A Note on Rapid Change

🕑 These Capabilities Are a Moving Target

Multimodal capabilities are advancing faster than any other area we cover in this course. Model versions change quarterly. Capability comparisons that are accurate today may not be accurate in six months. A model announced after this page was written may outperform every entry in the table above on one or more dimensions.

The skill this week is not memorising which model does what — it is developing the habit of testing claims about multimodal capability against your own research tasks before trusting them.

CharXiv is a good example: when the benchmark launched in 2024, leading models that scored 90%+ on standard chart benchmarks scored substantially less on real scientific charts. Frontier models have since improved, but the underlying lesson is durable — standard benchmarks rarely capture what matters for research. When you read claims about AI performance on a new multimodal task, the first question should always be: “What was it actually tested on, and how close is that to my data?”

📚 Core Readings

Wang et al. (2024)

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

NeurIPS 2024 (Princeton University). The benchmark study that established the 47% finding. Essential reading for anyone planning to use AI with scientific figures, charts, or data visualisations. Includes a detailed analysis of why models fail — not just the benchmark scores.

2,323 charts from real published papers
Descriptive vs. reasoning question split
Human baseline comparison
Root-cause analysis of failure modes

arxiv.org/abs/2406.18521 ↗

Rahmanzadehgervi et al. (2024)

Vision Language Models Are Blind

ACCV 2024. Tests models on tasks trivial for humans: overlapping circles, intersecting lines, counting objects in simple arrangements. Models average 58.07% on these tasks — barely better than random on some sub-tasks. Provides the mechanistic explanation for the CharXiv result: models are not seeing the geometry.

7 simple geometric task categories
4 state-of-the-art VLMs evaluated
Failure mode taxonomy
Implications for scientific image use

arxiv.org/abs/2407.06581 ↗

✅ Summary and What's Next

Week 8 at a Glance

This sub-lesson introduced the four modalities relevant to research — images, audio, documents, and video — and mapped them to the tools and research contexts where they matter most. The model landscape table gives you a working comparison of where the major tools are today; treat it as a snapshot, not a permanent reference.

The central conceptual frame for the week is the reading vs. understanding distinction. The CharXiv benchmark provides the most rigorous available evidence for why this matters: even as model scores have improved, real-world performance on actual scientific figures consistently lags performance on purpose-built evaluation benchmarks. The companion “VLMs Are Blind” paper explains one key mechanism: models often read surrounding text labels rather than processing image geometry.

The remaining sub-lessons go deeper into each domain:

Sub-Lesson 2: AI and Scientific Images — the CharXiv finding in depth, the correct-answer-wrong-reasoning problem, domain-specific tools, and bias in image recognition
Sub-Lesson 3: Document Intelligence — OCR, table extraction, PDF structure, and where document AI genuinely excels vs. fails
Sub-Lesson 4: Transcription and Audio Analysis — Whisper performance, African language support, speaker diarisation, and hallucination in audio transcription
Sub-Lesson 5: Video and Multimodal Workflows — Gemini's long-context video capability, temporal reasoning, and practical workflows for lecture and field recording analysis
Sub-Lesson 6: Hands-On Activities and Assessment — three practical exercises (figure analysis, self-recorded transcription test, document table extraction) and the weekly assessment

Each sub-lesson includes a practical workflow section — concrete steps for using that modality reliably in your own research, including how to verify outputs before trusting them.